A Simple Model for Forecasting Housing Prices

We want to start simply, so we'll just look at a basic forecast of the national Case-Schiller index based on population, and then add a few other variables. Schiller provides the data for this on the webpage for his book Irrational Exuberance.

We'll use pymc, a well-developed and maintained python library for bayesian statistical analysis, especially Markov Chain Monte Carlo (MCMC).

Other libraries used here include pandas and numpy. matplotlib is used for plotting, and I use a custom configuration file based on the one used in Bayesian Programming for Hackers, a great book with tons of good information on bayesian methods and graphing.

So first we need to import modules:


In [28]:
#Modules
import pymc as pm
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
%matplotlib inline

Then let's load our data. I've made a few quick changes to convert the original file named "Fig2-1.xls" so that it is easy to import:


In [208]:
#Data file
datafile = '/home/potterzot/data/irrational_exuberance/prices_population.csv'
data = pd.read_csv(datafile)

data.columns


Out[208]:
Index(['date', 'real_home_price_idx', 'date2', 'real_build_cost_idx', 'us_pop', 'long_rate', 'long_rate_src', 'date3', 'hpi', 'hpi_src', 'date4', 'nom_build_cost_idx', 'nom_build_cost_src', 'date5', 'cpi', 'cpi_src', 'date6', 'cpi_annual'], dtype='object')

Before we do anything, let's look at a few samples of what our data look like:


In [234]:
data.loc[0:10,:]


Out[234]:
date real_home_price_idx date2 real_build_cost_idx us_pop long_rate long_rate_src date3 hpi hpi_src date4 nom_build_cost_idx nom_build_cost_src date5 cpi cpi_src date6 cpi_annual
0 1890 100.000000 1890 51.362340 63.056 3.42 Homer 1890 3.657175 Grebler 1890 NaN Grebler 1890 7.611652 Warren&Pearson 1890 7.611652
1 1891 88.011791 1891 47.266163 64.361 3.62 Homer 1891 3.299213 Grebler 1891 NaN Grebler 1891 7.801942 Warren&Pearson 1891 7.801942
2 1892 95.421736 1892 52.048155 65.666 3.60 Homer 1892 3.358873 Grebler 1892 NaN Grebler 1892 7.326213 Warren&Pearson 1892 7.326213
3 1893 92.297385 1893 44.673332 66.970 3.75 Homer 1893 3.502058 Grebler 1893 NaN Grebler 1893 7.897091 Warren&Pearson 1893 7.897091
4 1894 123.980483 1894 57.263434 68.275 3.70 Homer 1894 4.080763 Grebler 1894 NaN Grebler 1894 6.850483 Warren&Pearson 1894 6.850483
5 1895 117.455092 1895 61.470338 69.580 3.46 Homer 1895 3.704903 Grebler 1895 NaN Grebler 1895 6.565052 Warren&Pearson 1895 6.565052
6 1896 100.302990 1896 60.068945 70.885 3.60 Homer 1896 3.209723 Grebler 1896 NaN Grebler 1896 6.660193 Warren&Pearson 1896 6.660193
7 1897 106.515703 1897 62.384893 72.189 3.40 Homer 1897 3.311145 Grebler 1897 NaN Grebler 1897 6.469903 Warren&Pearson 1897 6.469903
8 1898 110.184140 1898 61.438038 73.494 3.35 Homer 1898 3.525922 Grebler 1898 NaN Grebler 1898 6.660193 Warren&Pearson 1898 6.660193
9 1899 103.853113 1899 64.044606 74.799 3.10 Homer 1899 3.370805 Grebler 1899 NaN Grebler 1899 6.755342 Warren&Pearson 1899 6.755342
10 1900 101.574295 1900 49.420634 76.094 3.15 Homer 1900 3.854054 Grebler 1900 NaN Grebler 1900 7.897091 Warren&Pearson 1900 7.897091

In [265]:
#Plot real house prices and population versus date
z = np.polyfit(data['date'],data['real_home_price_idx'],1)
p = np.poly1d(z)
figsize(12, 4)
plt.title("National Real Home Price Index and Population")
plt.plot(data.loc[data['date']>=1960, 'date'], data.loc[data['date']>=1960,'real_home_price_idx'], label="CS Index")
plt.plot(data.loc[data['date']>=1960, 'date'], p(data.loc[data['date']>=1960, 'date']), color='grey', linewidth=1)
plt.plot(data.loc[data['date2']>=1960, 'date2'], data.loc[data['date2']>=1960, 'us_pop'], label="Population")
plt.ylim(0,350)
legend = plt.legend()
legend.get_frame().set_alpha(0.4)
plt.tight_layout()
plt.show()


Notice that, although the scale here is scrunched together pretty badly, it seems like the population and the trendline for housing prices are fairly similar. Let's look at that more closely.


In [266]:
plt.clf()
figsize(12, 4)
plt.plot(data['date'], p(data['date']), label="Housing Price Trendline")
plt.plot(data['date2'], data.loc[:,'us_pop']/10+68, label="Population") #Here we scale and center the population
plt.title("Housing Price Trendline and US Population, Scaled and Centered")
plt.ylim(0,150)
legend = plt.legend()
legend.get_frame().set_alpha(0.4)
plt.tight_layout()
plt.show()


We can see that housing prices have increased somewhat more quickly than population, so population may not be a great predictor, but it's not a terrible place to start with a simple model. It makes sense that as population increases, the price of housing might also be expected to increases, especially in spatially-constrained geographies like dense urban-areas.

We'll refine the model substantially later, but starting with a simple model will allow us to get the basic framework of the code down. It can always be made more complicated, but starting with something complicated can make development quite a bit slower.

Setting up a Forecasting Model

We'll use a bayesian approach to the forecast, in an attempt to estimate a percent change in price in each year with a sense of how confident we are in the estimate.

Our prior can just be the average annual percent growth from 1890 to 2014.


In [258]:
#Set up an annual data data file
adata = pd.DataFrame()
adata['date'] = data.loc[data['date']<=1950, 'date']
adata['rhpi'] = data.loc[data['date']<=1950, 'real_home_price_idx']
adata['pop'] = data.loc[data['date']<=1950, 'us_pop']

#Quarterly data
#qdata = pd.DataFrame()
#qdata['date'] = data.loc[data['date']>=1960, 'date']
#qdata['rhpi'] = data.loc[data['date']>=1960, 'real_home_price_idx']
#qdata['year'] = np.floor(qdata['date'])
#qdata['pop'] = data.loc[data['date2']>=qdata['year'], 'us_pop']
#qdata

#Average annual change
last_year = adata.loc[adata.index[-1],'date']
first_year = adata.loc[0, 'date']
n_years = last_year-first_year
rhpi = adata['rhpi']
delta_date = adata['date'].diff()
delta_rhpi = rhpi.diff()[1:]
avg_annual_chg = delta_rhpi.sum()/(n_years)
pct_chg = delta_rhpi / np.array(rhpi[0:-1])

#Compound rate of change

avg_rate_chg = (rhpi[rhpi.index[-1]]/rhpi[0])**(1/(n_years)) - 1
{'actual': rhpi[rhpi.index[-1]], 'estimate': rhpi[0]*(1+avg_rate_chg)**(n_years)} #check that our estimate is correct


Out[258]:
{'estimate': 105.89483934250032, 'actual': 105.89483934250001}

In [319]:
#Plot the year to year % changes
plt.clf()
figsize(12, 4)
plt.title("Percent change from period to period")
plt.bar(adata.loc[1:,'date'], pct_chg*100)
locs,labels = plt.yticks()
plt.yticks(locs, list(map((lambda x: "{}%".format(int(x))), locs)))
plt.show()


But what's the distribution of these percent changes? If we want to capture the way it tends to move, let's look at a histogram of the percent change:


In [407]:
#mean and standard deviation
m = pct_chg.mean()
sd = 10/(np.std(pct_chg))

# Create a normal distribution
p = pm.Normal('p', m, sd)
mdl_pct_chg = np.array([p.random() for i in np.arange(0,10000)])

#
plt.clf()
figsize(12, 4)
hist1, bins1 = np.histogram(pct_chg, bins=15)
hist2, bins2 = np.histogram(mdl_pct_chg, bins=15)
width = 0.7 * (bins[1] - bins[0])
center1 = (bins[:-1] + bins[1:]) / 2
center2 = (bins2[:-1] + bins2[1:]) / 2
plt.bar(center1, hist1, align='center', width=width)
plt.plot(center2, hist2*len(pct_chg)/10000)
plt.title("Histogram of Percent Change")
plt.show()



In [398]:
#Now create a MCMC model...
#@pm.deterministic
#def m_():
#def sd_():
m_ = pm.Normal('m', 0, 1)
sd_ = pm.Normal('sd', 0, 1)

observations = pm.Normal("obs", m, sd, value=pct_chg, observed=True)
model = pm.Model([observations, m_, sd_])

In [399]:
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)


 [-----------------100%-----------------] 40000 of 40000 complete in 3.7 sec

In [437]:
#Get the traces, predicted points of the shape of m and sd
m_samples = mcmc.trace('m')[:]
sd_samples = mcmc.trace('sd')[:]

# histogram of the samples:
plt.clf()
figsize(12.5, 8)

ax = plt.subplot(211)
ax.set_autoscaley_on(False)

plt.hist(m_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\mu$", color="#A60628", normed=True)
plt.legend(loc="upper left")
plt.title(r"""Posterior distributions of the variables
    $\mu,\;\tau$""")
plt.xlim([-3, 3])
plt.xlabel("$\mu$ value")

ax = plt.subplot(212)
ax.set_autoscaley_on(False)
plt.hist(sd_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\tau$", color="#7A68A6", normed=True)
plt.legend(loc="upper left")
plt.xlim([-3, 3])
plt.xlabel("$\tau$ value")

plt.show()


That probably doesn't seem like much, except that now we have an estimate of the two variables that we think are affecting the underlying percent change in housing price, assuming the change is just a random variable from a normal distribution.

But that's not really very interesting. That just says that prices change randomly as defined by our normal distribution. What we hope is that we can determine a series of factors that affect the change. Let's begin by assuming that the change in price is dependent on the change in population in the previous year.